Optimal giving up result pilot summary

Data

Here we are comparing different pilot versions of the optimal stopping experiment. We are trying to replicate the result from Costermans et al (1992) in which reaction time on omission trials (when no response is given) is increasing with feeling of knowing (FOK) reports. We want to show this effect in our paradigm, replacing the FOK report with the memory strength measurement based on earlier recall performance.

We’ve run many versions of the experiment and we get mixed results. This report summarizes all the versions. The critical dimension appears to be the instructions, given in full below. To summarize, in the “leading instructions” we encourage participants to skip trials in two ways: 1. We instruct “So if you don’t think you know the word, it might be best to quickly skip the trial to get the time bonus and avoid the error penalty.” 2. We have a quiz question “If you don’t know the word you should…” with the correct answer “guess a random word”.

We’ve also run a pilot with a higher bonus rate (5¢ vs 3¢ per correct response) and one with no time incentive on the critical trials. But these changes don’t seem to make as much of a difference.

Response types

Below are the distribution of response types in the critical trials, in aggregate and by participant. This is before any exclusions.

all_trials %>% 
    count(pilot, response_type) %>% 
    pivot_wider(names_from=response_type, values_from=n) %>% 
    replace(is.na(.), 0) %>% 
    pivot_longer(!c(pilot, pilot), names_to="response_type", values_to="n", names_prefix="") %>%
    group_by(pilot) %>% mutate(prop = n/sum(n)) %>% ungroup() %>% 
    ggplot(aes(pilot, prop, fill=fct_rev(response_type))) +
    geom_bar(stat="identity") +
    scale_colour_manual(values=c(
        "deeppink2",
        "deeppink4", 
        "gray", 
        "springgreen4"
    ), aesthetics=c("fill", "colour"), name="") +
    labs(x="", y="Proportion of Trials") + coord_flip()

response_colors = scale_colour_manual(values=list(
    other="deeppink2",
    intrusion="deeppink4", 
    empty="gray", 
    correct="springgreen4"
), aesthetics=c("fill", "colour"), name="")

all_trials %>% 
    count(pilot, wid, response_type) %>% 
    pivot_wider(names_from=response_type, values_from=n) %>% 
    replace(is.na(.), 0) %>% 
    mutate(wid=fct_reorder(wid, 100*correct + empty)) %>% 
    pivot_longer(!c(pilot, wid), names_to="response_type", values_to="n", names_prefix="") %>%
    group_by(pilot, wid) %>% mutate(prop = n/sum(n)) %>% ungroup() %>% 
    ggplot(aes(wid, prop, fill=fct_rev(response_type))) +
    geom_bar(stat="identity") +
    response_colors +
    labs(x="Participant", y="Proportion of Trials") + 
    scale_x_discrete(breaks=NULL) +
    facet_wrap(~pilot, scales="free_x")

We see that:

  • there are few errors (incorrect responses) in all pilots
  • empty responses are common in all pilots
  • there are more correct responses when the critical trials were more heavily incentivized (5¢ vs 3¢)
  • theare are more empty responses with leading instructions

Exclusions

Participants who always or never skip (give an empty response) don’t provide useful data. We thus exclude participants who don’t have at least 4 skip and non-skip trials out of the 37 critical trials. Note that we end up excluding more participants with the leading instructions because more participants skip almost every trial.

all_trials$name = 'Human'

excl = all_trials %>% 
    group_by(pilot,wid) %>%
    summarise(n_skip = sum(skip)) %>% 
    mutate(keep=between(n_skip, 4, 33))

excl %>%
    mutate(excluded = if_else(keep, "included", "excluded")) %>% 
    ggplot(aes(pilot, fill=excluded)) + 
    geom_bar() +
    ylab("Number of Participants") +
    scale_colour_manual(values=c(
        "gray", "dodgerblue"
    ), aesthetics=c("fill", "colour"), name="") + coord_flip() + xlab("")

keep = excl %>% filter(keep) %>% with(wid)

trials = all_trials %>% 
    filter(wid %in% keep) %>% 
    group_by(wid) %>% 
    mutate(rt_z = zscore(rt)) %>% 
    ungroup()

pretest = pretest %>% filter(wid %in% keep)

pretest %>% 
    filter(block == max(block)) %>% 
    rename(pre_correct = correct) %>% 
    mutate(pre_logrt = if_else(pre_correct, log(rt), 0)) %>% 
    group_by(wid, word) %>% 
    summarise(across(c(pre_correct, pre_logrt), mean)) %>%
    group_by(wid) %>%
    mutate(across(c(pre_correct, pre_logrt), zscore, .names="{.col}_z")) %>% 
    mutate(
        raw_strength = -((1-pre_correct) * log(15000) + pre_logrt),
        strength = zscore(raw_strength)
    ) %>% 
    right_join(trials) -> trials

Subjective judgements

The most direct replication of Costermans et al.: How does reaction time depend on explicit reports of confidence and feeling of knowing, given after a response is made?

Confidence for correct responses

How confident are you in your response?

Press a number between 1 and 5.

1   I am not at all sure my response is correct
2   I am not so sure my response is correct
3   I am more or less sure my response is correct
4   I am nearly sure my response is correct
5   I am absolutely sure my response is correct
trials %>% #plot
    filter(response_type == "correct") %>% 
    regress(judgement, rt) +
    xlab("Confidence Judgement")
pilot estimate std.error p.value
high bonus -142.4327 70.6379 0.0784
leading instructions -90.3989 35.6747 0.0133
no time bonus -492.4245 189.7270 0.0318
standard -216.7514 89.5107 0.0506

All in the right direction. The effect is much larger when speed is not incentivized.

FOK for empty responses

How much do you feel that you know the word?

Press a number between 1 and 5.

1   I am absolutely sure I do not know the word
2   I am rather sure I do not know the word
3   I have a vague impression I know the word
4   I am rather sure I know the word
5   I am absolutely sure I know the word
trials %>%  #plot
    filter(skip) %>% 
    regress(judgement, rt, bins=0, bin_range=1) +
    stat_summary(fun.data=mean_cl_boot, size=.2) +
    xlab("Feeling of Knowing Judgement")
pilot estimate std.error p.value
high bonus 314.4393 110.9809 0.0345
leading instructions 193.8678 60.0537 0.0134
no time bonus 1125.4998 318.6457 0.0053
standard 270.2153 169.8991 0.2695

All in the right direction!

Conclusion: We consistently replicate the Costermans finding with explicit judgements. Results are strongest when speed is not incentivized.

Objective memory strength measure

We can now ask the same thing, using performance on the pretest as an objective measure of the strength of each memory. The strength of each pair is defined as the negative average log reaction time on the two pretest exposures, where an inaccurate response counts as the maximum reaction time.

trials %>% #plot
    filter(response_type == "correct") %>% 
    regress(strength, rt)
pilot estimate std.error p.value
high bonus -160.2295 42.8961 0.0003
leading instructions -248.5228 58.3524 0.0002
no time bonus -427.9684 125.2094 0.0038
standard -276.4942 107.0522 0.0205

We consistently see faster responses for higher-strength cues. This is unsurprising.

Reaction time on empty trials

Here’s the critical effect.

trials %>% #plot
    filter(skip) %>% 
    regress(strength, rt) + 
    coord_cartesian(xlim=c(NULL), ylim=c(0, 4000))
pilot estimate std.error p.value
high bonus -126.1001 146.0841 0.4337
leading instructions 112.0613 40.2186 0.0181
no time bonus -57.8315 216.7666 0.8229
standard 116.5285 91.0194 0.2251

The effect only comes out with leading instructions. In two pilots, the effect was actually negaive (although given the size of the std error, we can’t take this too seriously).

The effect is a bit more robust, when we z-score strength and reaction time within participant (specifically for skip trials), but we still only see it with leading instructions.

trials %>% #plot
    filter(skip) %>% 
    group_by(wid) %>% 
    filter(sd(strength) != 0) %>% 
    mutate(strength=zscore(strength), rt_z = zscore(rt)) %>% 
    regress(strength, rt_z) +
    coord_cartesian(xlim=c(NULL), ylim=c(-1.5, 1.5))
pilot estimate std.error p.value
high bonus 0.0820 0.1320 0.5960
leading instructions 0.1864 0.0513 0.0015
no time bonus -0.0443 0.0818 0.5887
standard 0.0433 0.1039 0.6898

This is not just some quirk of sample. I actually ran three identical versions of the “leading instructions” experiment, and all three of them show a strong effect, with no other sample showing an close to the same size (with z-scoring):

trials %>%
    filter(skip) %>% 
    group_by(wid) %>% 
    filter(sd(strength) != 0) %>% 
    mutate(strength=zscore(strength), rt_z = zscore(rt)) %>% 
    group_by(version) %>% 
    group_modify(function(data, grp) {
        lmer(rt_z ~ strength + (strength|wid), data=data) %>% tidy
    }) %>% 
    filter(term == "strength") %>% 
    select(version, estimate, std.error, p.value) %>% 
    left_join(
        load_data('participants') %>% count(version)
    ) %>% kable(digits=4)
version estimate std.error p.value n
v6.5 0.2062 0.1261 0.1502 10
v6.5B 0.2264 0.0933 0.0169 10
v6.5C 0.1735 0.0703 0.0311 20
v6.6 0.0841 0.1354 0.5960 10
v6.7 0.0442 0.1061 0.6898 20
v6.8 -0.0454 0.0838 0.5887 20

What’s going on here?

Why do we only get the strength effect with the leading instructions?

Individual regressions

Our first clue comes from looking at the effect separately for each participant:

trials %>% 
    group_by(wid) %>% 
    filter(skip) %>% 
    ggplot(aes(strength, rt, group=wid)) + 
    geom_smooth(method="lm", level=0, size=.5, color="black") +
    facet_wrap(~pilot) +
    pretty_labs("strength", "rt")

In the leading instructions group, the slope varies considerably, but most people are in the right direction. But wait—why are there so few lines in the other groups? It turns out that many participants in those conditions have no variance in strength on skip trials, which means we can’t run a regression. This happens when a participant only skips on images for which they answered incorrectly in both pretest trials, resulting in the minimum possible strength value.

Here’s a table summarizing the number and proportion of participants who skip on at least one trial without minimum strength. (Note that it matches the number of lines in the regression plot above).

trials %>% 
    group_by(pilot, wid) %>% 
    filter(skip) %>% 
    summarise(x=mean(pre_correct==0))  %>% 
    group_by(pilot) %>% 
    summarise(n=sum(x != 1), prop=mean(x != 1)) %>% kable(digits=2)
pilot n prop
high bonus 5 0.56
leading instructions 25 0.89
no time bonus 8 0.53
standard 9 0.56

Metacognitive threshold

What seeems to be happening is that the skipping threshold is a bit lower (more lenient) in the leading instructions group. We can see this in an aggregate logistic regression:

trials %>% 
    ggplot(aes(strength, 1*skip, color=pilot)) +
    geom_smooth(method = "glm", method.args = list(family = "binomial"), formula=y~x) +
    pretty_labs("strength", "probability of skipping")

It’s even more clear if we show individuals’ curves, using the raw (un-normalized) strength measure. The red line is the minimum possible strength, corresponding to two incorrect responses.

trials %>% 
    ggplot(aes(raw_strength, 1*skip, group=wid)) +
    geom_smooth(method = "glm", method.args = list(family = "binomial"), formula=y~x, 
        se=F, color="black", size=.5) +
    facet_wrap(~pilot) +
    theme(
        panel.grid.major.x = element_line(color="gray"),
        panel.grid.major.y = element_line(color="gray"),
    ) + geom_vline(xintercept=-log(15000), color="red2") +
    pretty_labs("raw strength", "probability of skipping")

Where does that leave us? It seems that the effect does exist, but that it is obscured when people only skip minimum-strength words. There are a few strategies we could take:

  1. Use the leading instructions. This is the surest thing, but it might draw suspicion. Personally, I think it’s fine to encourage participants a bit here, as the effect still depends on their having the metacognitive ability. But I can imagine this causing problems in review.
  2. Use the standard instructions, exclude participants who only skip minimum-strength trials and plan to collect a very large sample.
  3. Try to lower peoples’ skipping threshold in some other way. For example, we could use a stricter error penalty.
  4. Use a strength measure that is more sensitive at the low end. For example, we could try using 2AFC.

Conclusion: We get the critical result of slower skipping for higher strength targets, but only when we encourage skipping in the instructions. It looks like the critical difference is that the instructions lower the skipping threshold, giving us more range of strength in the skip trials.

Metacognitive accuracy

A secondary question we can ask is to what extent the metacognitive judgement correlates with the objective memory strength measure. Overall, it looks like it does.

trials %>% 
    filter(skip | correct) %>% 
    group_by(wid) %>% 
    filter(sd(judgement) != 0) %>% 
    mutate(judgement=zscore(judgement), strength=zscore(strength)) %>% 
    ggplot(aes(judgement, strength, color=judgement_type)) +
    geom_point(size=.5) +
    geom_smooth(method="lm", se=F) +
    facet_wrap(~pilot) +
    labs(x="Judgement (z-scored)", "Strength (z-scored)")

However, when we look at just the FOK trials, we don’t actually see a consistent relationship within individuals (each thin line is a person) and the thick line is a mixed effects fit.

trials %>% #plot
    filter(skip) %>% 
    regress(judgement, strength, bins=0, bin_range=1) +
    # stat_summary(fun.data=mean_cl_boot, size=.2) +
    stat_summary(aes(group=wid), fun.y=mean, size=.2, geom="line") +
    xlab("Feeling of Knowing Judgement")
pilot estimate std.error p.value
high bonus 0.0027 0.0200 0.8944
leading instructions 0.0278 0.0286 0.3582
no time bonus -0.0022 0.0222 0.9230
standard 0.2295 0.1527 0.1829

Here is the regression coefficient in a strength-by-FOK model fit separately to each participant, excluding those who always give the same FOK rating.

fok = trials %>% 
    filter(judgement_type == "fok") %>%
    group_by(wid) %>% 
    filter(sd(judgement) != 0)

X = fok %>% 
    group_modify(function(data, grp) {
        lm(strength ~ judgement, data=data) %>% tidy(conf.int=T)
    }) %>% 
    filter(term == "judgement")  %>% 
    left_join(select(excl, wid, pilot)) %>% 
    arrange(estimate) %>% 
    ungroup() %>% 
    mutate(wid=fct_reorder(wid, estimate))

X %>% 
    ggplot(aes(estimate, wid, xmin=conf.low, xmax=conf.high, color=pilot)) +
    geom_pointrange() + geom_vline(xintercept=0) +
    theme(legend.position="top")

Very few participants show a positive relationship. It looks like part of the reason for this is that participants are not using the full scale:

trials %>% 
    filter(skip) %>% 
    ggplot(aes(judgement, ..prop..)) +
    geom_bar() +
    facet_wrap(~pilot) +
    labs(x="Feeling of Knowing Judgement", y="Proportion of Trials")

Conclusion: We don’t see a consistent relationship between the FOK judgement and memory strength. However, it seems likely that this is because people aren’t able to use the scale in a meaningful way, rather than because they really don’t have any sense of their memory strength.

Miscellaneous

Pretest accuracy

pretest %>%
    group_by(pilot, wid) %>% 
    summarise(accuracy=mean(correct)) %>% 
    ggplot(aes(pilot, accuracy)) +
    geom_quasirandom(color="gray") +
    stat_summary(fun.data=mean_cl_boot) +
    coord_flip() +
    xlab("") +
    ylim(0, 1)

Critical trials reaction time

trials %>%
    mutate(response_type = case_when(
        response_type == "correct" ~ "correct",
        response_type == "empty" ~ "empty",
        TRUE ~ "error"
    )) %>% 
    ggplot(aes(pilot, rt, color=response_type)) +
    # geom_quasirandom(color="gray") +
    stat_summary(fun.data=mean_cl_boot) +
    coord_flip() +
    scale_colour_manual(values=c(
        "springgreen4",
        "gray", 
        "deeppink3"
    ), aesthetics=c("color"), name="") +
    xlab("")

Instructions screenshots

Standard instructions

Leading instructions